KnCr: A Short-Text Narrow-Domain Sub-Corpus of Medline

نویسنده

  • David Pinto
چکیده

Clustering of short texts in narrow domains is one of the most difficult tasks due to the high overlapping of vocabularies among the texts and also to the specific terminology used by researchers. Here, we are presenting a new corpus of scientific texts in medicine domain, specifically about “Cancer” topics. This corpus is a subset of the last MEDLINE sample, made up of 900 abstracts of 16 different categories. This compilation is provided as a dataset for the evaluation of algorithms in this area. Preliminary experiments carried out with this corpus highlight its difficulty and reinforce the hypothesis of using it in this challenging

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Self-enriching Methodology for Clustering Narrow Domain Short Texts

s of Scientific Texts Using the Transition Point Technique. Proc. CICLing Conference—CICLing’06, Mexico city, Mexico, February 19–25, Lecture Notes in Computer Science 3878, pp. 536–546. Springer, Berlin. [24] Alexandrov, M., Gelbukh, A. and Rosso, P. (2005) An Approach to Clustering Abstracts. Proc. 10th Int. Conf.Application of Natural Language to Information Systems— NLDB’05, Alicante, S...

متن کامل

Density-based clustering of short-text corpora∗ Agupamiento de textos cortos basado en densidad

In this work, we analyse the performance of different density-based algorithms on short-text and narrow domain short-text corpora. We attempt to determine to what extent the features of this kind of corpora impact on the density computation of the clusterings obtained and how robust these algorithms to the different complexity levels are.

متن کامل

BioDCA Identifier: A System for Automatic Identification of Discourse Connective and Arguments from Biomedical Text

This paper describes a Natural language processing system developed for automatic identification of explicit connectives, its sense and arguments. Prior work has shown that the difference in usage of connectives across corpora affects the cross domain connective identification task negatively. Hence the development of domain specific discourse parser has become indispensable. Here, we present a...

متن کامل

Developing a Corpus-Based Word List in Pharmacy Research ‎Articles: A Focus on Academic Culture

The present corpus-based lexical study reports the development of a Pharmacy Academic Word List (PAWL); a list of the most frequent words from a corpus of 3,458,445 tokens made up of 800 most recent pharmacy texts including research articles, review articles, and short communications in four sub-disciplines of pharmacy. WordSmith (Scott, 2017) and AntWordProfiler (Anthony, 2014) were used to sc...

متن کامل

Question answering in biomedicine

The recent developments in Question Answering have kept with open-domain questions and collections, sometimes argued as being more difficult than narrow domain-focused questions and corpora. The biomedical field is indeed a specialized domain; however, its scope is fairly broad, so that considering a biomedical QA task is not necessarily such a simplification over open-domain QA as represented ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2006